Add TMA TensorMapDescriptor support by rparolin · Pull Request #1687 · NVIDIA/cuda-python

rparolin · 2026-02-24T23:22:18Z

Summary

Add TensorMapDescriptor Cython class wrapping the CUDA driver's CUtensorMap for Hopper+ TMA (Tensor Memory Accelerator) bulk data movement
Support tiled and im2col descriptor creation via from_tiled() and from_im2col() class methods, with automatic dtype inference, stride computation, and validation
Integrate TensorMapDescriptor as a first-class kernel argument in _kernel_arg_handler.pyx
Add comprehensive tests (test_tensor_map.py) and an example (tma_tensor_map.py)

Closes #199
Closes #200

copy-pr-bot · 2026-02-24T23:22:22Z

Auto-sync is disabled for ready for review pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

Copilot

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.

Comments suppressed due to low confidence (2)

cuda_core/pixi.toml:67

Removing the cu12 environment from this subproject can break the repository’s top-level pixi run -e cu12 test workflow, which runs pixi run --manifest-path cuda_core test under the propagated PIXI_ENVIRONMENT_NAME=cu12. If cu12 testing is still expected at the workspace level, consider keeping a solvable cu12 environment here (e.g., using conda-forge cuda-bindings/cuda-version constraints instead of the path dependency) or updating the workspace test tasks to avoid selecting a missing environment.


# NOTE: cu12 environment is intentionally omitted because the path dependency
# to ../cuda_bindings (v13.1) makes it unsolvable locally. For cu12 testing,
# use conda-forge packages or CI workflows.
[environments]
default = { features = [
    "cu13",
    "test",
    "cython-tests",
], solve-group = "default" }
cu13 = { features = ["cu13", "test", "cython-tests"], solve-group = "default" }

cuda_core/cuda/core/_tensor_map.pyx:461

c_pixel_box_lower / c_pixel_box_upper are declared as fixed-size int[3] but only the first n_spatial entries are written. If the driver implementation reads all 3 entries (the API supports up to 3 spatial dims), the remaining uninitialized values can make encoding nondeterministic. Initialize the full arrays (e.g., set all 3 to 0 first) before filling the active elements.

        cdef uint64_t[5] c_global_dim
        cdef uint64_t[4] c_global_strides
        cdef uint32_t[5] c_element_strides
        cdef int[3] c_pixel_box_lower  # max 3 spatial dims (rank 5 - 2)
        cdef int[3] c_pixel_box_upper
        cdef int i_c

        for i_c in range(rank):
            c_global_dim[i_c] = <uint64_t>shape[rank - 1 - i_c]
            c_element_strides[i_c] = <uint32_t>element_strides[rank - 1 - i_c]

        for i_c in range(rank - 1):
            c_global_strides[i_c] = <uint64_t>byte_strides[rank - 2 - i_c]

        # Reverse spatial dimensions for lower/upper corners
        for i_c in range(n_spatial):
            c_pixel_box_lower[i_c] = <int>pixel_box_lower_corner[n_spatial - 1 - i_c]
            c_pixel_box_upper[i_c] = <int>pixel_box_upper_corner[n_spatial - 1 - i_c]

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-25T01:31:01Z

cuda_core/cuda/core/_tensor_map.pyx

+        view = _get_validated_view(tensor)
+        desc._source_ref = tensor
+


TensorMapDescriptor stores _source_ref = tensor, but when tensor is a DLPack producer the pointer/metadata lifetime is governed by the DLPack capsule returned by __dlpack__(). Since the temporary StridedMemoryView (which holds the capsule and calls the deleter in __dealloc__) is not retained, the capsule can be released immediately, potentially invalidating globalAddress for exporters where the capsule owns the backing allocation. Store a strong reference to the StridedMemoryView (or at least its metadata capsule) instead of (or in addition to) the original tensor object.

cuda_core/cuda/core/_tensor_map.pyx

rparolin · 2026-02-25T22:14:52Z

/ok to test

github-actions · 2026-02-25T22:26:45Z

Doc Preview CI
🚀 View preview at https://nvidia.github.io/cuda-python/pr-preview/pr-1687/
https://nvidia.github.io/cuda-python/pr-preview/pr-1687/cuda-core/
https://nvidia.github.io/cuda-python/pr-preview/pr-1687/cuda-bindings/
https://nvidia.github.io/cuda-python/pr-preview/pr-1687/cuda-pathfinder/
Preview will be ready when the GitHub Pages deployment is complete.

rparolin · 2026-02-25T23:08:13Z

/ok to test

rparolin · 2026-02-26T00:00:16Z

/ok to test

leofang

There is a coordinated effort between C++ and Python: #199 (comment). Can we please look into reusing the C++ implementation (mainly because @fbusato is a TMA expert) and avoid re-implementing it if possible?

fbusato · 2026-02-27T17:03:45Z

Fighting with poor documentation and bugs don't make me an expert :).
Anyway, we provide two main functionalities in this direction:

make_tma_descriptor() DLPack to cuTensorMap.
mdspan to DLPack and DLPack to mdspan.

The implementation of make_tma_descriptor is here https://github.com/NVIDIA/cccl/blob/main/libcudacxx/include/cuda/__tma/make_tma_descriptor.h. Please let me know if there are functionalities that need to be isolated for reuse.

Copilot

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.

Comments suppressed due to low confidence (3)

cuda_core/tests/test_tensor_map.py:102

This test passes a raw Buffer from dev.allocate() with data_type=FLOAT32. Buffer exports via DLPack as an int8 tensor with shape=(n_bytes,), so the TMA encoder will treat shape[0] as a float32 element count unless the implementation compensates for this. That can create a descriptor that covers 4× more memory than the allocation and hide potential out-of-bounds issues. Prefer wrapping the buffer in _DeviceArray(buf, (1024,), dtype=np.float32) (or StridedMemoryView.from_buffer with the intended shape/dtype) so the descriptor is built from element-count dimensions matching the data type.

        buf = dev.allocate(1024 * 4)  # 1024 float32 elements
        desc = TensorMapDescriptor.from_tiled(
            buf,
            box_dim=(64,),
            data_type=TensorMapDataType.FLOAT32,
        )

cuda_core/tests/test_tensor_map.py:277

Same issue as test_from_tiled_1d: building a descriptor from a raw Buffer with data_type=FLOAT32 relies on the implementation translating the buffer's byte-length into a float32 element count. To avoid encoding a descriptor with incorrect global_dim, wrap buf1/buf2 in _DeviceArray(..., dtype=np.float32) (or a StridedMemoryView with the intended dtype/shape) before calling from_tiled() / replace_address().

    def test_replace_address(self, dev, skip_if_no_tma):
        buf1 = dev.allocate(1024 * 4)
        desc = TensorMapDescriptor.from_tiled(
            buf1,
            box_dim=(64,),
            data_type=TensorMapDataType.FLOAT32,
        )

cuda_core/cuda/core/_kernel_arg_handler.pyx:305

Support for passing TensorMapDescriptor as a kernel argument is added here, but there’s no test exercising the full path (ParamHolder → cuLaunchKernel) with a real TensorMapDescriptor argument. Given cuda_core/tests/test_launcher.py already validates scalar/buffer argument handling, consider adding a small integration test that launches a kernel taking a CUtensorMap by value and verifies it can be consumed (or at least that the kernel receives the expected 128-byte payload). This will protect against ABI/size/alignment regressions in the argument marshalling logic.

            elif arg_type is tensor_map_descriptor_type:
                prepare_tensor_map_arg(self.data, self.data_addresses, <TensorMapDescriptor>arg, i)
                continue
            elif arg_type is bool:
                prepare_arg[cpp_bool](self.data, self.data_addresses, arg, i)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

cuda_core/cuda/core/_tensor_map.pyx

…time Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Remove unused _alloc_device_tensor helper from tests - Add test for rank > 5 (6D tensor) to verify upper bound validation - Add NULL check for PyMem_Malloc in prepare_tensor_map_arg Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Move the replace_address() demonstration into its own self-contained example (tma_replace_address.py) so each file covers a single concept. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…alidated views alive to avoid DLPack-backed pointer lifetime hazards. Add explicit tiled element-stride coverage and acknowledge the DLPack include-layout compatibility follow-up in NVIDIA/cccl#7871. Made-with: Cursor

cpcloud · 2026-03-07T02:30:08Z

/ok to test

cuda_core/cuda/core/_kernel_arg_handler.pyx

.gitignore

cuda_core/cuda/core/__init__.py

cuda_core/examples/tma_replace_address.py

cuda_core/examples/tma_tensor_map.py

Probe support in the fixture and skip when cuda.core is built without CUDA 13 im2col-wide support or when the driver/GPU reports CUDA_ERROR_INVALID_VALUE, so unsupported RTXPRO6000 lanes don't block unrelated changes. Made-with: Cursor

cpcloud · 2026-03-07T14:17:47Z

/ok to test

…safety. Expose only TensorMapDescriptor in cuda.core, add StridedMemoryView.as_tensor_map(), remove redundant tensor-map fallback packing, and track/check descriptor context/device compatibility before replacement and kernel launch argument packing. Made-with: Cursor

cpcloud · 2026-03-07T14:44:26Z

/ok to test

Bring back the cu12 feature blocks so pixi can parse the manifest and local test commands no longer fail early with a missing feature error. Made-with: Cursor

cpcloud · 2026-03-07T14:54:53Z

/ok to test

Reject CUDA device-local tensors from a different GPU while still allowing CUDA host and managed memory. Add regression tests for descriptor creation, replace_address, and the shared validation helper.

rparolin · 2026-03-11T16:26:53Z

/ok to test

cpcloud · 2026-03-12T18:50:19Z

@leofang Please review. This is now blocked on your review.

cuda_core/cuda/core/_cpp/tensor_map_cccl.h

leofang · 2026-03-13T20:27:36Z

cuda_core/cuda/core/_cpp/tensor_map.cpp

+#  if __has_include(<dlpack/dlpack.h>)
+#    include <dlpack/dlpack.h>
+#    define CUDA_CORE_HAS_DLPACK_H 1
+#  else
+#    define CUDA_CORE_HAS_DLPACK_H 0
+#  endif


Q: we guarantee to have a dlpack.h during build time, but it's not accessible via <dlpack/dlpack.h>, so does it mean we end up with CUDA_CORE_HAS_DLPACK_H == 0?

Yes, this is generally good practice in C++ code so you can avoid having to check if a pre-processor macro is defined before checking its value.

#if defined(CUDA_CORE_HAS_DLPACK_H) && CUDA_CORE_HAS_DLPACK_H == 0 // ... #endif

leofang · 2026-03-13T20:31:13Z

cuda_core/cuda/core/_cpp/tensor_map.cpp

+
+#if defined(__has_include)
+#  if __has_include(<cuda/tma>)
+#    include <cuda/tma>


Q: I am confused -- The TMA header was added fairly recently. We build cuda.core against both CUDA 12 & 13 and merge the resulting wheels. Without vendoring the CCCL header, how did we manage to build and make CI green? 🤔

It's only available via CTK until CUDA 13.2, meaning before this week it was not there.
https://nvidia.github.io/cccl/unstable/libcudacxx/extended_api/tma/make_tma_descriptor.html

The dependent code is only compiled in when the header is available: CUDA_CORE_HAS_CUDA_TMA.

leofang · 2026-03-13T20:34:58Z

cuda_core/cuda/core/_kernel_arg_handler.pyx

+        vector.vector[void*]& data_addresses,
+        TensorMapDescriptor arg,
+        const size_t idx) except -1:
+    arg._check_context_compat()


I am torn on this. I understand the rational but this call will be very slow since it involves multiple driver API calls.

We should check this when a TMA is constructed in Python, memoize the device/context attributes, and then move on. For example, we don't do pointer attribute check at launch time either. It adds just too much overhead.

leofang · 2026-03-13T20:36:59Z

cuda_core/cuda/core/_kernel_arg_handler.pyx

+    # Allocate a temporary buffer for the 128-byte CUtensorMap struct.
+    # We copy rather than pointing directly at arg._tensor_map for lifetime
+    # safety: ParamHolder owns and frees its argument buffers independently.
+    cdef void* ptr = PyMem_Malloc(sizeof(cydriver.CUtensorMap))
+    if ptr is NULL:
+        raise MemoryError("Failed to allocate memory for CUtensorMap")
+    memcpy(ptr, arg._get_data_ptr(), sizeof(cydriver.CUtensorMap))


This is unnecessary because the driver will copy again. Just pack the pointer and pass it to cuLaunchKernel, and the driver will copy it over.

cuda_core/examples/tma_replace_address.py

leofang · 2026-03-13T21:10:42Z

cuda_core/examples/tma_tensor_map.py

+        // Initialise a single-phase mbarrier (1 arriving thread).
+        asm volatile(
+            "mbarrier.init.shared.b64 [%0], 1;"
+            :: "r"((unsigned)__cvta_generic_to_shared(&mbar)));
+
+        // Ask TMA to copy TILE_SIZE floats starting at element 'tile_start'
+        // from the tensor described by 'tensor_map' into shared memory.
+        asm volatile(
+            "cp.async.bulk.tensor.1d.shared::cluster.global.tile"
+            ".mbarrier::complete_tx::bytes"
+            " [%0], [%1, {%2}], [%3];"
+            :: "r"((unsigned)__cvta_generic_to_shared(smem)),
+               "l"(&tensor_map),
+               "r"(tile_start),
+               "r"((unsigned)__cvta_generic_to_shared(&mbar)));
+
+        // Tell the mbarrier how many bytes the TMA will deliver.
+        asm volatile(
+            "mbarrier.arrive.expect_tx.shared.b64 _, [%0], %1;"
+            :: "r"((unsigned)__cvta_generic_to_shared(&mbar)),
+               "r"((unsigned)(TILE_SIZE * sizeof(float))));


leofang · 2026-03-13T21:10:54Z

cuda_core/examples/tma_tensor_map.py

+        asm volatile(
+            "{ .reg .pred P;                                           \n"
+            "WAIT:                                                     \n"
+            "  mbarrier.try_wait.parity.shared.b64 P, [%0], 0;         \n"
+            "  @!P bra WAIT;                                           \n"
+            "}                                                         \n"
+            :: "r"((unsigned)__cvta_generic_to_shared(&mbar)));


leofang · 2026-03-13T21:11:47Z

cuda_core/examples/tma_replace_address.py

I now think this example should just be combined with tma_tensor_map.py, since we have lots of code repetition here.

cuda_core/cuda/core/__init__.py

Co-authored-by: Leo Fang <leo80042@gmail.com>

rparolin requested review from Copilot and leofang February 24, 2026 23:23

Copilot started reviewing on behalf of rparolin February 24, 2026 23:23 View session

This comment was marked as resolved.

Sign in to view

rparolin self-assigned this Feb 25, 2026

rparolin added this to the cuda.core v0.7.0 milestone Feb 25, 2026

rparolin requested a review from Copilot February 25, 2026 01:21

Copilot started reviewing on behalf of rparolin February 25, 2026 01:21 View session

Copilot AI reviewed Feb 25, 2026

View reviewed changes

rparolin assigned cpcloud Feb 26, 2026

leofang requested changes Feb 27, 2026

View reviewed changes

rparolin added the cuda.core Everything related to the cuda.core module label Mar 2, 2026

rparolin requested a review from Copilot March 3, 2026 23:56

Copilot started reviewing on behalf of rparolin March 3, 2026 23:56 View session

Copilot AI reviewed Mar 4, 2026

View reviewed changes

cuda_core/cuda/core/_tensor_map.pyx Show resolved Hide resolved

rparolin and others added 9 commits March 6, 2026 21:26

initial commit

e3e1899

tma wide

77a3c8e

clean up

19c4a0f

Add comments to prepare_tensor_map_arg explaining allocation and life…

35a04b9

…time Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Address Copilot review feedback

bb19e4f

- Remove unused _alloc_device_tensor helper from tests - Add test for rank > 5 (6D tensor) to verify upper bound validation - Add NULL check for PyMem_Malloc in prepare_tensor_map_arg Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Split TMA example into two focused files

23a8900

Move the replace_address() demonstration into its own self-contained example (tma_replace_address.py) so each file covers a single concept. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

pre-commit

0a1b720

adding stride meta data to gpu allocated memory

44fbdcf

im2col fixes

bdf39a2

cpcloud force-pushed the rparolin/tma_feature branch from f4875f6 to 96a3e84 Compare March 7, 2026 02:28

leofang requested changes Mar 7, 2026

View reviewed changes

Restore cu12 feature definitions in cuda_core pixi manifest.

5a0e141

Bring back the cu12 feature blocks so pixi can parse the manifest and local test commands no longer fail early with a missing feature error. Made-with: Cursor

cpcloud requested a review from leofang March 7, 2026 16:54

rparolin added 3 commits March 10, 2026 10:39

Handle TensorMap device validation by DLPack type

eef1c7a

Reject CUDA device-local tensors from a different GPU while still allowing CUDA host and managed memory. Add regression tests for descriptor creation, replace_address, and the shared validation helper.

Merge branch 'main' into rparolin/tma_feature

c97269d

formatting change

b7a956c

rparolin requested a review from cpcloud March 11, 2026 16:29

cpcloud approved these changes Mar 12, 2026

View reviewed changes

rparolin enabled auto-merge (squash) March 12, 2026 22:25

leofang disabled auto-merge March 13, 2026 20:22

leofang requested changes Mar 13, 2026

View reviewed changes

rparolin and others added 3 commits March 13, 2026 14:20

Update cuda_core/cuda/core/_cpp/tensor_map_cccl.h

7d3c931

Co-authored-by: Leo Fang <leo80042@gmail.com>

Update cuda_core/examples/tma_replace_address.py

635462d

Co-authored-by: Leo Fang <leo80042@gmail.com>

Update cuda_core/cuda/core/__init__.py

c17cdea

Co-authored-by: Leo Fang <leo80042@gmail.com>

Conversation

rparolin commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Uh oh!

copy-pr-bot bot commented Feb 24, 2026

Uh oh!

This comment was marked as resolved.

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rparolin commented Feb 25, 2026

Uh oh!

github-actions bot commented Feb 25, 2026

Preview will be ready when the GitHub Pages deployment is complete.

Uh oh!

rparolin commented Feb 25, 2026

Uh oh!

rparolin commented Feb 26, 2026

Uh oh!

leofang left a comment

Choose a reason for hiding this comment

Uh oh!

fbusato commented Feb 27, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

cpcloud commented Mar 7, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cpcloud commented Mar 7, 2026

Uh oh!

cpcloud commented Mar 7, 2026

Uh oh!

cpcloud commented Mar 7, 2026

Uh oh!

rparolin commented Mar 11, 2026

Uh oh!

cpcloud commented Mar 12, 2026

Uh oh!

Uh oh!

leofang Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rparolin commented Feb 24, 2026 •

edited

Loading

leofang Mar 13, 2026 •

edited

Loading